A Survey on Evaluation of Large Language Models

Back

This survey provides a comprehensive overview of the evaluation of Large Language Models (LLMs). The authors discuss the importance of evaluation in the advancement of AI models and outline three key aspects of evaluation: what to evaluate, how to evaluate, and where to evaluate.

In terms of what to evaluate, the authors highlight the success and failure cases of LLMs in different tasks. LLMs demonstrate proficiency in generating text and have impressive performance in language understanding tasks such as sentiment analysis and text classification. They also excel in arithmetic reasoning, logical reasoning, and question answering tasks. However, LLMs encounter challenges in accurately representing human disagreements and can exhibit limited abilities in discerning semantic similarity and abstract reasoning. They may also manifest biases and toxicity in generated outputs.

When it comes to how to evaluate, the authors discuss various evaluation protocols and benchmarks that have been developed for LLMs. These include benchmarks like ARB, TRUSTGPT, EmotionBench, and SafetyBench, which focus on specific domains and applications. There are also benchmarks for evaluating LLMs in different languages, such as C-Eval for Chinese and CMMLU for Chinese proficiency. The authors also highlight the importance of human evaluation in assessing the quality and accuracy of model-generated results.

In terms of where to evaluate, the authors discuss the shift from static to dynamic evaluation and the need for evaluation systems that can adapt and evolve alongside LLMs. They also emphasize the importance of designing AGI (Artificial General Intelligence) benchmarks to measure the capabilities of LLMs. Additionally, the authors highlight the need for evaluations that test the robustness of LLMs against a wide variety of inputs and that ensure principled and trustworthy evaluation.

The authors conclude by highlighting several grand challenges and opportunities for future research in the evaluation of LLMs. These include designing AGI benchmarks, developing robustness evaluations, creating dynamic and evolving evaluation systems, ensuring principled and trustworthy evaluation, and building evaluation systems that support all LLMs tasks. The authors also note that evaluation is not the end goal but rather the starting point, and emphasize the importance of using evaluation results to enhance LLMs and drive future research and development.

Overall, this survey provides a comprehensive overview of the evaluation of LLMs and identifies key challenges and opportunities for future research. By summarizing existing efforts and highlighting areas for improvement, the authors contribute to the advancement of the evaluation discipline in the context of LLMs.

Words: 400